# haz_2018 and haz_2005
# haz_2005 <- haz_data %>% 
  # filter(year == 2005)

# haz_2016 <- haz_data %>% 
  # filter(year == 2016)
# import new smaller more efficient data frames
haz_2005 <- read_csv("haz_2005.csv")
haz_2016 <- read.csv("haz_2016.csv")

Introduction

The purpose of my analysis is to understand how chemical pollutant concentrations, specifically carcinogens, in the atmosphere affect cancer rates in the population. For my analysis, I used two data sets. The “Hazardous Air Pollutants” data set from the EPA, which measured the concentration of different chemical pollutants on 24 hour intervals from multiple measuring sites in each state. The initial CSV file had over eight million observations and required some significant initial trimming and wrangling to get it into a format that my computer and myself could reasonably handle. The other data set was cancer rates by year and state for the United States, that came directly from the CDC. This data set required less wrangling, but it was an enjoyable challenge to get the two data sets to work int tandem with one another and test for correlation and causation between the two. I spent a significant amount of time exploring the data using a mapping package to compare spatial and temporal variations in both cancer and chemical pollutant concentrations, specifically carcinogenic chemicals. I also spent some time zooming in a little bit on some chemicals I deemed of interest from my initial analysis (Lead and Benzene) and compared daily variations, over the period of a year, by specific states that had higher and lower cancer rates. This analysis was followed by some hypothesis testing that verified my qualitative understanding of how chemical concentrations varied by state, but then looking at how cancer rates, at least qualitatively varied with these selected chemicals lead to me to build a linear model. While the model didn’t end up being very predictive and powerful, the process gave insight into the complexity of how cancer is caused, and nonetheless was an interesting exercise in bringing two large data sets together to try understand how they may be related.

Ethical Discussion

This exploration and analysis of two complex, but important and useful data sets, to me, represents the amount of accessible information of this type to the public. With just a little bit of searching and knowledge of how to wrangle and understand data that I have learned in this course, I was able to reduce, model and comprehend the relationship of two data sets, one with over eight million observations. It’s important that this type of data is available to the public and think it’s valuable to spend time trying to understand and parse through information like this. In terms of stakeholders, I think that everybody in the world stands to benefit from an analysis like this. Parsing through data set like these two relate chemical pollutants and cancer rates is useful and productive thing to do because it helps inform how the people doing this type of analysis for a living build their models for what causes cancer and what we can do to prevent it. Cancer is a horrible thing and anything we can do to better understand it is valuable for everyone because everyone is at risk.

Ethically, I don’t see any issues with doing this. The only people who may stand to lose are the polluters as their company names are in the data set and a if a study such as this one found that they were doing harm, they may find themselves buried in regulation, which I would argue wouldn’t be such a bad thing. There’s no identification of individuals in the analysis and I find this project to be the least ethically dubious of everything I have done this semester. I don’t find myself seeing anything of value being lost except by companies polluting the atmosphere, which is arguably an ethically dubious thing to do in the first place, so if I was to identify particularly prolific contributors, i don’t find anything wrong with singling them out if I chose to do so.

Exploratory Figures

Histograms

For my analysis, I chose to focus in, initially on the years 2005 and 2016, but then finally dialed in on 2005 as it made the data sets more manageable and the magnitude more in line with the amount of time I have for this project.

The following histograms look at U.S. average concentrations of all chemicals measured by the EPA data set in their respective concentrations, which are mug/m^3, nano grams/m^3 and parts per billion carbon. The histograms show averages from both 2005 and 2016 to show overall temporal changes in United States emissions. This is by no means a complete picture. As it will become clear later, specific states drastically affect these averages and it is to these states that we look to begin to try and qualitatively understand how cancer rates are modulated by pollutant concentrations.

# histograms of different chemicals in atmosphere in 2005
ggplot(microgram_sum_2005, aes(x = parameter_name, y = mean))+
  geom_col(color = "black", fill = "seagreen4")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust= 0.5))+
  labs(
    title = "Chemicals with Concentration in Micrograms in 2005",
    y = expression("Micrograms per Cubic Meter ("*mu*"g/m^3)")
  )+
  theme(axis.title.x = element_blank(), plot.title = element_text(hjust = 0.5))+
  theme(axis.ticks.x = element_blank())+
  scale_y_continuous(limits = c(0.000, 0.0042))

# histograms of different chemicals in atmosphere in 2005
ggplot(microgram_sum_2016, aes(x = parameter_name, y = mean))+
  geom_col(color = "black", fill = "seagreen4")+
  scale_y_continuous()+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust= 0.5))+
  labs(
    title = "Chemicals with Concentration in Micrograms in 2016",
    y = expression("Micrograms per Cubic Meter ("*mu*"g/m^3)")
  )+
  theme(axis.title.x = element_blank(), plot.title = element_text(hjust = 0.5))+
  theme(axis.ticks.x = element_blank())+
  scale_y_continuous(limits = c(0.000, 0.0042))

Figure 1: Concentrations of chemicals measured in mug/m^3 for both 2005 and 2016.

There is an evident overall change in concentrations from 2005 and 2016 with an overall decrease in the average in each chemical, with the exception being Cadmium, which is further explored later on.

# histograms of different chemicals in atmosphere in 2005
ggplot(nanogram_sum_2005, aes(x = parameter_name, y = mean))+
  geom_col(color = "black", fill = "turquoise4")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust= 0.5))+
  labs(
    title = "Chemicals with Concentration in Nanograms in 2005",
    y = "Nanograms per Cubic Meter (ng/m^3)"
  )+
  theme(axis.title.x = element_blank(), plot.title = element_text(hjust = 0.5))+
  theme(axis.ticks.x = element_blank())+
  ylim(0,16)

# histograms of different chemicals in atmosphere in 2016
ggplot(nanogram_sum_2016, aes(x = parameter_name, y = mean))+
  geom_col(color = "black", fill = "turquoise4")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust= 0.5))+
  labs(
    title = "Chemicals with Concentration in Nanograms in 2016",
    y = "Nanograms per Cubic Meter (ng/m^3)"
  )+
  theme(axis.title.x = element_blank(), plot.title = element_text(hjust = 0.5))+
  theme(axis.ticks.x = element_blank())+
  ylim(0,16)

Figure 2: Concentrations of chemicals measured in ng/m^3 for both 2005 and 2016.

There isn’t a dramatic change in chemicals measured in nano grams from 2005 to 2016. While there are minute decreases in some chemicals, there are increases in others, but the relative magnitude of each bar doesn’t change dramatically.

# histograms of different chemicals in atmosphere in 2005
ggplot(ppbc_sum_2005, aes(x = parameter_name, y = mean))+
  geom_col(color = "black", fill = "olivedrab3")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust= 0.5))+
  labs(
    title = "Chemicals with Concentration in PPBC in 2005",
    y = "Parts per Billion Carbon"
  )+
  theme(axis.title.x = element_blank(), plot.title = element_text(hjust = 0.5))+
  theme(axis.ticks.x = element_blank())+
  ylim(0,3.5)

# histograms of different chemicals in atmosphere in 2016
ggplot(ppbc_sum_2016, aes(x = parameter_name, y = mean))+
  geom_col(color = "black", fill = "olivedrab3")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust= 0.5))+
  labs(
    title = "Chemicals with Concentration in PPBC in 2016",
    y = "Parts per Billion Carbon"
  )+
  theme(axis.title.x = element_blank(), plot.title = element_text(hjust = 0.5))+
  theme(axis.ticks.x = element_blank())+
  ylim(0,3.5)

Figure 3: Concentrations of chemicals measured in parts per billion carbon for both 2005 and 2016.

As with micro gram chemicals, there is an overall muting of this plot from 2005 to 2016. Of all of the chemicals measured in ppbc, formaldehyde and acetaldehyde tend to have the highest concentrations.

Maps

Spatial mapping provides a unique perspective into how cancer rates and chemical concentrations have changed both spatially and temporally. The following maps show, most importantly, cancer rates, but also the concentrations of benzene, lead, vinyl chloride and cadmium, all known carcinogens. It plots the state averages in both 2005 and 2016 to show temporal variation. These plots informed a qualitative analysis that lead to a more quantitative analysis later on.

# map of 2005 cancer rates in the US
plot_usmap(data = cancer_rates_2005, values = "RATE", color = "black") + 
  scale_fill_gradient(name = "Cancer Rate (Cases Per 100,000)", limits = range(cancer_rates_2005$RATE, cancer_rates_2016$RATE), low = "white", high ="darkblue",label = scales::comma) + 
  theme(legend.position = "right")+
  labs(
    title = "US Cancer Rates per 100,000 in 2005",
  )+
  theme(plot.title = element_text(hjust = 0.5))

# map of 2016 cancer rates in the US
plot_usmap(data = cancer_rates_2016, values = "RATE", color = "black") + 
  scale_fill_gradient(name = "Cancer Rate (Cases Per 100,000)", limits = range(cancer_rates_2005$RATE, cancer_rates_2016$RATE), low = "white", high ="darkblue",label = scales::comma) + 
  theme(legend.position = "right")+
  labs(
    title = "US Cancer Rates per 100,000 in 2016",
  )+
  theme(plot.title = element_text(hjust = 0.5))

Figure 4: Map of U.S. cancer rates, by state in 2005 and 2016.

On its own, this plot is incredibly reassuring as the rate of cancer seems to have decreased significantly from 2005 to 2016. I hope through this analysis that I can begin to piece together whether changes in known carcinogen concentrations in the air have helped modulate this change.

# map of benzene ppbc concentreations in 2005
plot_usmap(data = ppbc_benzene_2005, values = "mean", color = "black") + 
  scale_fill_gradient(name = expression("Concentration (ppbc)"), limit = range(ppbc_benzene_2005$mean, ppbc_benzene_2016$mean), low = "steelblue3", high = "violetred2",label = scales::comma) + 
  theme(legend.position = "right")+
  labs(
    title = "Benzene Concentrations in 2005 by State",
  )+
  theme(plot.title = element_text(hjust = 0.5))

# map of benzene ppbc concentreations in 2016
plot_usmap(data = ppbc_benzene_2016, values = "mean", color = "black") + 
  scale_fill_gradient(name = expression("Concentration (ppbc)"), limit = range(ppbc_benzene_2005$mean, ppbc_benzene_2016$mean), low = "steelblue3", high = "violetred2",label = scales::comma) + 
  theme(legend.position = "right")+
  labs(
    title = "Benzene Concentrations in 2016 by State",
  )+
  theme(plot.title = element_text(hjust = 0.5))

Figure 5: Map of Benzene concentrations, by state in 2005 and 2016.

Many states don’t have measurements from both years, but there has been a clear, qualitative decreases in the amount of benzene present in the atmosphere, but the amount present in 2016 is certainly not insignificant.

# map of lead microgram concentreations in 2005
plot_usmap(data = microgram_lead_2005, values = "mean", color = "black") + 
  scale_fill_gradient(name = expression("Concentration ("*mu*"g)"), limits = range(microgram_lead_2005$mean, microgram_lead_2016$mean), low = "yellow3", high = "Red2", label = scales::comma) + 
  theme(legend.position = "right")+
  labs(
    title = "Lead Concentrations in 2005 by State",
  )+
  theme(plot.title = element_text(hjust = 0.5))

# map of lead microgram concentreations in 2016
plot_usmap(data = microgram_lead_2016, values = "mean", color = "black") + 
   scale_fill_gradient(name = expression("Concentration ("*mu*"g)"), limits = range(microgram_lead_2005$mean, microgram_lead_2016$mean), low = "yellow3", high = "Red2", label = scales::comma) + 
  theme(legend.position = "right")+
  labs(
    title = "Lead Concentrations in 2016 by State",
  )+
  theme(plot.title = element_text(hjust = 0.5))

Figure 6: Map of Lead PM2.5 LC concentrations, by state in 2005 and 2016.

The states with the most lead present saw a decrease from 2005 to 2016 (specifically Alabama and Ohio), but other states, especially in the west, saw an increase in concentrations, so the overall amount, at least qualitatively, has not decreased, but rather spread out.

# map of vinyl chloride ppbc concentrations in 2005
plot_usmap(data = ppbc_vinyl_chloride_2005, values = "mean", color = "black") + 
   scale_fill_gradient(name = expression("Concentration ppbc"), limits = range(ppbc_vinyl_chloride_2005$mean, ppbc_vinyl_chloride_2016$mean), low = "white", high = "seagreen4", label = scales::comma) + 
  theme(legend.position = "right")+
  labs(
    title = "Vinyl Chloride Concentrations in 2005 by State",
  )+
  theme(plot.title = element_text(hjust = 0.5))

# map of vinyl chloride ppbc concentrations in 2016
plot_usmap(data = ppbc_vinyl_chloride_2016, values = "mean", color = "black") + 
   scale_fill_gradient(name = expression("Concentration ppbc"), limits = range(ppbc_vinyl_chloride_2005$mean, ppbc_vinyl_chloride_2016$mean), low = "white", high = "seagreen4", label = scales::comma) + 
  theme(legend.position = "right")+
  labs(
    title = "Vinyl Chloride Concentrations in 2016 by State",
  )+
  theme(plot.title = element_text(hjust = 0.5))

Figure 7: Map of Vinyl Chloride concentrations, by state in 2005 and 2016.

Overall, vinyl chloride concentrations have decreased in the states with available measurements. Kentucky, Washington and Florida appear to be chronic contributors to emissions.

# map of cadmium microgram concentreations in 2005
plot_usmap(data = microgram_cadmium_2005, values = "mean", color = "black") + 
  scale_fill_gradient(name = expression("Concentration ("*mu*"g)"), limits = range(microgram_cadmium_2005$mean, microgram_cadmium_2016$mean), low = "white", high = "turquoise4", label = scales::comma) + 
  theme(legend.position = "right")+
  labs(
    title = "Cadmium Concentrations in 2005 by State",
  )+
  theme(plot.title = element_text(hjust = 0.5))

# map of cadmium microgram concentreations in 2016
plot_usmap(data = microgram_cadmium_2016, values = "mean", color = "black") + 
  scale_fill_gradient(name = expression("Concentration ("*mu*"g)"), limits = range(microgram_cadmium_2005$mean, microgram_cadmium_2016$mean), low = "white", high = "turquoise4", label = scales::comma) + 
  theme(legend.position = "right")+
  labs(
    title = "Cadmium Concentrations in 2016 by State",
  )+
  theme(plot.title = element_text(hjust = 0.5))

Figure 8: Map of Cadmium PM2.5 LC concentrations, by state in 2005 and 2016.

Cadmium has also seen an overall decrease in concentrations, but the amount still present is fairly concerning and future studies should concern themselves with primary contributors in each state and work to majorly decrease emissions.

Throughout these plots, I see an evident overall decrease in chemical concentrations and cancer rates from 2005 to 2016 and some specific states of anomalously low and high cancer rates and emission quantities. I qualitatively infer a correlation between chemical concentration and rates of cancer and predict that further model building and hypothesis testing will yield the same result.

Exploratory Plots of Specific Chemical Changes by Day in States With Higher and Lower Cancer Rates

After identification of Ohio and Alabama with relatively high cancer rates and carcinogen concentrations and Minnesota and Utah at the lower end of the spectrum, I decided to plot them against each other to qualitatively inform what my t-tests will quantitatively tell me later on. I chose benzene and lead, as they are known carcinogens and, at least qualitatively, the cancer rates seemed to loosely line up with concentrations in 2005.

# alabama benzene 2005 plot
ggplot(alabama_2005_benzene, aes(x = date_local, y = arithmetic_mean, color = ifelse(arithmetic_mean >= 5, "violetred2", "steelblue3")))+
  geom_point()+
  theme_bw()+
  labs(
    title = "Alabama Benzene Concentrations Through 2005",
    x = "Time",
    y = expression("Benzene Concentrations (ppbc)")
  )+
  theme(plot.title = element_text(hjust = 0.5))+
  ylim(0,100)+
  theme(legend.position = "none")+
  scale_color_manual(values = c("steelblue3", "violetred2"))

# ohio benzene
ggplot(ohio_benzene_2005, aes(x = date_local, y = arithmetic_mean, color = ifelse(arithmetic_mean >= 5, "violetred2", "steelblue3")))+
  geom_point()+
  theme_bw()+
  labs(
    title = "Ohio Benzene Concentrations Through 2005",
    x = "Time",
    y = expression("Benzene Concentrations (ppbc)")
  )+
  theme(plot.title = element_text(hjust = 0.5))+
  ylim(0,100)+
  theme(legend.position = "none")+
  scale_color_manual(values = c("steelblue3", "violetred2"))

# Utah benzene
ggplot(utah_2005_benzene, aes(x = date_local, y = arithmetic_mean, color = ifelse(arithmetic_mean >= 5, "violetred2", "steelblue3")))+
  geom_point()+
  theme_bw()+
  labs(
    title = "Utah Benzene Concentrations Through 2005",
    x = "Time",
    y = expression("Benzene Concentrations (ppbc)")
  )+
  theme(plot.title = element_text(hjust = 0.5))+
  ylim(0,100)+
  theme(legend.position = "none")+
  scale_color_manual(values = c("steelblue3", "violetred2"))

# Utah benzene
ggplot(minnesota_2005_benzene, aes(x = date_local, y = arithmetic_mean, color = ifelse(arithmetic_mean >= 5, "violetred2", "steelblue3")))+
  geom_point()+
  theme_bw()+
  labs(
    title = "Minnesota Benzene Concentrations Through 2005",
    x = "Time",
    y = expression("Benzene Concentrations (ppbc)")
  )+
  theme(plot.title = element_text(hjust = 0.5))+
  ylim(0,100)+
  theme(legend.position = "none")+
  scale_color_manual(values = c("steelblue3", "violetred2"))

Figure 9: Plot of benzene concentrations through 2005, in Alabama, Ohio, Minnesota and Utah. Above threshold concentrations, color changes to qualitatively impress whether a state tended to have anomalously high values, or whether they had higher concentrations more often, or with more magnitude than another state.

There is a clear qualitative difference in benzene concentrations, when comparing Alabama/Ohio to Minnesota/Utah. I will quantify this difference in the next section.

# alabama lead 2005 plot
ggplot(alabama_2005_lead, aes(x = date_local, y = arithmetic_mean, color = ifelse(arithmetic_mean >= 0.025, "violetred2", "steelblue3")))+
  geom_point()+
  theme_bw()+
  labs(
    title = "Alabama Lead Concentrations Through 2005",
    x = "Time",
    y = expression("Lead Concentrations (" *mu* "g/m" ^3* ")")
  )+
  theme(plot.title = element_text(hjust = 0.5))+
  ylim(0,0.5)+
  theme(legend.position = "none")+
  scale_color_manual(values = c("yellow3", "red2"))

# ohio lead 2005 plot
ggplot(ohio_lead_2005, aes(x = date_local, y = arithmetic_mean, color = ifelse(arithmetic_mean >= 0.025, "violetred2", "steelblue3")))+
  geom_point()+
  theme_bw()+
  labs(
    title = "Ohio Lead Concentrations Through 2005",
    x = "Time",
    y = expression("Lead Concentrations (" *mu* "g/m" ^3* ")")
  )+
  theme(plot.title = element_text(hjust = 0.5))+
  ylim(0,0.5)+
  theme(legend.position = "none")+
  scale_color_manual(values = c("yellow3", "red2"))

# utah lead 2005 plot
ggplot(utah_2005_lead, aes(x = date_local, y = arithmetic_mean, color = ifelse(arithmetic_mean >= 0.025, "violetred2", "steelblue3")))+
  geom_point()+
  theme_bw()+
  labs(
    title = "Utah Lead Concentrations Through 2005",
    x = "Time",
    y = expression("Lead Concentrations (" *mu* "g/m" ^3* ")")
  )+
  theme(plot.title = element_text(hjust = 0.5))+
  ylim(0,0.5)+
  theme(legend.position = "none")+
  scale_color_manual(values = c("yellow3", "red2"))

# minnesota lead 2005 plot
ggplot(minnesota_2005_lead, aes(x = date_local, y = arithmetic_mean, color = ifelse(arithmetic_mean >= 0.025, "violetred2", "steelblue3")))+
  geom_point()+
  theme_bw()+
  labs(
    title = "Minnesota Lead Concentrations Through 2005",
    x = "Time",
    y = expression("Lead Concentrations (" *mu* "g/m" ^3* ")")
  )+
  theme(plot.title = element_text(hjust = 0.5))+
  ylim(0,0.5)+
  theme(legend.position = "none")+
  scale_color_manual(values = c("yellow3", "red2"))

Figure 10: Plot of lead concentrations through 2005, in Alabama, Ohio, Minnesota and Utah. Above threshold concentrations, color changes to qualitatively impress whether a state tended to have anomalously high values, or whether they had higher concentrations more often, or with more magnitude than another state.

Through 2005, there is a blatant qualitative difference between lead concentrations in Alabama/Ohio and Minnesota/Utah.At first glance, the number of red dots in the first states are much higher than those in second states.

In 2005, data points that represent relatively high values of benzene, or lead are much more frequent in states that correspond to high rates of cancer and this merits not only statistical tests to quantify this difference, but also a predictive model.

Statistical Testing

This section compares the means benzene and lead concentrations between Alabama, Ohio, Minnesota and Utah using a series of t-tests. I hope to quantify the difference between states that I observed to have both high and low cancer rates and chemical concentrations. Statistical evidence that there is in fact a difference in chemical concentrations between these states would validate a decision to build a model, predicting cancer rates from chemical pollutants.

# output benzene t-tests
benzene_t_test_output <- map_df(list(benzene_alabama_ohio_test, benzene_minnesota_utah_test, benzene_alabama_minnesota_test, benzene_alabama_utah_test, benzene_ohio_minnesota_test, benzene_ohio_utah_test), tidy)

# output as a table
benzene_comparison_names <- tibble(test_name = c("Alabama and Ohio", "Minnesota and Utah", "Alabama and Minnesota", "Alabama and Utah", "Ohio and Minnesota", "Ohio and Utah"))

# join comparison names to the benzene t-test output
benzene_t_test_output <- cbind(benzene_t_test_output, benzene_comparison_names)

pander(benzene_t_test_output[c("test_name", "estimate", "p.value", "conf.low", "conf.high")], caption = "Results of t-tests Comparing Benzene Concentrations in 2005")
Results of t-tests Comparing Benzene Concentrations in 2005
test_name estimate p.value conf.low conf.high
Alabama and Ohio 2.856 0.001569 1.115 4.598
Minnesota and Utah -1.445 3.397e-12 -1.813 -1.077
Alabama and Minnesota 4.808 7.09e-08 3.239 6.376
Alabama and Utah 3.363 8.71e-05 1.755 4.97
Ohio and Minnesota 1.951 1.112e-06 1.177 2.726
Ohio and Utah 0.5063 0.2438 -0.3463 1.359

Figure 11: T-tests comparing benzene concentrations between states in 2005.

My alpha value for this entire section is a maximum p-value of 0.01. Comparing the states I used as a proxy for high and low chemical concentrations both showed statistically significant differences from one another. Alabama had an average concentration that was 2.856 ppbc higher than Ohio, verified by a p-value of 0.0016 and a confidence interval that does not cross 0. Minnesota has a lower average concentration that Utah with a p-value of 3.397 * 10^-12 and a 95% confidence interval that does not cross 0. When contrasting states of higher and lower average concentrations and checking for statistical differences, 3 showed a significant difference and 1 did not. Comparisons between Alabama and Minnesota, Alabama and Utah, and Ohio and Minnesota all yielded p-values below my threshold alpha as well as confidence intervals not crossing over 0. Surprisingly, the comparison between Ohio and Utah was not statistically significant, with a p-value of 0.24 and confidence interval crossing 0. I would attribute this to a lack of data, or a product of noise, as I think that this is a strange outcome considering the qualitative analysis done on figure 9.

# output lead t-tests
lead_t_test_output <- map_df(list(lead_alabama_ohio_test, lead_minnesota_utah_test, lead_alabama_minnesota_test, lead_alabama_utah_test, lead_ohio_minnesota_test, lead_ohio_utah_test), tidy)

# output as a table
lead_comparison_names <- tibble(test_name = c("Alabama and Ohio", "Minnesota and Utah", "Alabama and Minnesota", "Alabama and Utah", "Ohio and Minnesota", "Ohio and Utah"))

# join comparison names to the benzene t-test output
lead_t_test_output <- cbind(lead_t_test_output, lead_comparison_names)

pander(lead_t_test_output[c("test_name", "estimate", "p.value", "conf.low", "conf.high")], caption = "Results of t-tests Comparing Lead Concentrations in 2005")
Results of t-tests Comparing Lead Concentrations in 2005
test_name estimate p.value conf.low conf.high
Alabama and Ohio 0.001865 0.06494 -0.0001157 0.003845
Minnesota and Utah 7.56e-05 0.661 -0.0002626 0.0004138
Alabama and Minnesota 0.009197 6.235e-22 0.007376 0.01102
Alabama and Utah 0.009273 6.203e-22 0.007435 0.01111
Ohio and Minnesota 0.007333 1.984e-61 0.00652 0.008145
Ohio and Utah 0.007408 6.652e-59 0.006559 0.008257

Figure 12: T-tests comparing lead concentrations between states in 2005.

These series of t-tests are very similar to those in figure 11, but compare lead concentrations in choice states instead of benzene. Again, the alpha value here is 0.01. The two states chosen with qualitatively high concentrations were Alabama and Ohio. They are not statistically different with a p-value of 0.06 and a 95% confidence interval crossing over 0. The same is true of the test between the states with relatively low concentrations: Minnesota and Utah. The p-value here is even higher: 0.661 and the confidence interval also crosses 0. The next four comparisons which test for differences between states with qualitatively high and low concentrations all yield significant results. The p-values for all comparisons are far below 0.01 and their confidence intervals do not cross 0. The difference between states with high and low lead concentrations is more clean cut than with benzene and may have to do with better and more frequent measurement, or simply a larger range of values.

Following these statistical tests, I feel confident that there is some sort of relationship between chemical concentration and cancer rates, so I think it is a valid choice to build a model between the two.

Model Building

Building upon my qualitative analyses of spatial distributions, I seek to quantify the relationship between chemical pollutants and cancer rates. For this model, I will use rates of cancer and chemical concentrations in 2005, which makes a pretty important assumption. Chemical pollutants in 2005 did not modulate 2005 cancer rates and more likely has to do with the rates in later years. I am making the assumption that the proportion between cancer and chemical concentrations remains relatively steady through the years. I think this assumption is valid based on the qualitative differences in cancer rates and chemical concentration between 2005 and 2016 explored in the exploratory visual maps section.

# ggpairs test of potential linear model predictors
ggpairs(carcinogens_2005, columns = c("parameter_name", "arithmetic_mean", "first_max_value", "units_of_measure"))

# remove first max value

Figure 13: ggpairs plot for potential variables I am considering using for my linear model. This shows me variables with collinearity that I should consider before constructing my model.

The four variables I considered for my model were the parameter_name of selected carcinogens, which is simply the name of the chemical pollutant. The second was arithmetic mean, which was the mean value measured in a 24 hour period of a particular emission. The third was first_max_value, which was the maximum value of emission measured during the 24 hour time period. The last was the units of measurement as some of the chemical concentrations are measured differently based on average quantity. Following the ggpairs test, I saw collinearity between the arithmetic mean and first max value, so I chose to remove first_max_value from consideration and only input the other 3 variables into my initial model.

# create table
pander(cancer_lm_for_table[c("Estimate", "p_value", "2.5 %", "97.5 %")], caption = "Output for Linear Model Predicting Cancer") # p-value less than minimum R^2, but adjusted R^2 is 0.01
Output for Linear Model Predicting Cancer
  Estimate p_value 2.5 % 97.5 %
(Intercept) 184.2 0 184 184.4
Cadmium PM10 STP -0.05425 0.8713 -0.7108 0.6023
Cadmium PM2.5 LC 4.457 5.974e-206 4.173 4.742
Chromium PM10 STP 0.03851 0.9073 -0.6099 0.6869
Lead PM2.5 LC -1.131 1.435e-20 -1.37 -0.8929
Mercury PM10 STP 7.446 3.655e-14 5.519 9.374
Nickel PM10 STP 0.2231 0.5011 -0.4269 0.8731
Nickel PM2.5 LC -1.132 1.354e-20 -1.371 -0.8937
Vinyl chloride -1.242 2.45e-14 -1.562 -0.9229
arithmetic_mean -0.06568 2.011e-15 -0.08189 -0.04947

Figure 14: Output of linear model cleaned up a bit for visual purposes (check code).

After running my 3 potential variables through a linear model predicting cancer rates in 2005, I ran it through a step-Akaike test to balance complexity with the amount of variables to simplify the model. The test ended up removing the units_of_measure as it didn’t add enough to the model to merit keeping it. The p-value of the final model was less than R’s minimum output, but the adjusted R^2 was 0.01637 indicating that only 1.637 % of the scatter in my data was explained by the model. This is quite disconcerting as this tells me that my model’s predictive ability is essentially non-existent. Despite that, I will run through an analysis of my predictors as well as go through some model-diagnostics.

The actual mean value of a substance turned out to be a good predictor with a p-value far below the alpha value of 0.01 and a confidence interval not crossing 0. Cadmium measured in micrograms, lead measured in micrograms, mercury, nickel measured in micrograms and vinyl chloride all turned out to have statistically significant coefficients with p-values much below the threshold alpha as well as confidence intervals not crossing over 0. The model was also quite sure of the intercept with a p-value that was essentially 0. The estimated base cancer rate was 184.2 per 100,00 which is probably the most useful thing to come out of this model.

# histogram of model residuals
ggplot(mapping = aes(x = cancer_lm_2005$residuals))+
  geom_histogram(color = "Black", fill = "darkslategray3")+
  theme_bw()+
  xlim(-50,50)+
  labs(
    title = "Residual Distibution for 2005 Cancer Model",
    x = "Residuals",
    y = "Frequency"
  )+
  theme(plot.title = element_text(hjust = 0.5))

Figure 15: Residual distribution for multi variate linear model predicting cancer rates from cknown carcinogens.

This distribution is not perfect, but it does have a fairly normal distribution that is roughly centered around 0. The actual mean seems to be somewhere around -10 and the actual distribution is multi-modal, but that is arguably because it is made from an actual data set. The residuals at the lower and that don’t exist at the higher end is a bit concerning, but they aren’t in high enough quantity to really cause alarm. This distribution looks pretty good, but I would look for a more normal distribution with a center closer to 0 in an ideal world.

# model diagnostics
autoplot(cancer_lm_2005)

Figure 16: autoplot model diagnostics of linear model predicting cancer rates from chemical pollutant concentrations

These diagnostics tell me about what I expected. Low fitted values are essentially useless in terms of predictive ability and their fitted values tend to be quite high compared to the model. Around a fitted value of 180, the model becomes more accurate. The leveraging of the model by low fitted values is exemplified in botht he qqplot and leverage plot. Two specific points (states) with really low fitted values in 2005 not only leverage the model 1 way, but essentially invalidate it at low values. While this diagnostic plot told me for the most part what I already knew (that the model isn’t all that useful), I was able to see what specifically was leveraging the model.

Making Precicitions

The following predictions are made less for accuracy, but rather to show that I know how to make predictions and also that the model is unable to reasonably predict cancer rates based on the parameters in the model.

# create small dfs
higher_lead_2005 <- data.frame(parameter_name = "Lead PM2.5 LC", arithmetic_mean = 0.2)
lower_lead_2005 <- data.frame(parameter_name = "Lead PM2.5 LC", arithmetic_mean = 0.05)

# make the predicition
high_lead_2005_prediciton <- predict(cancer_lm_2005, higher_lead_2005, interval = "prediction")
low_lead_2005_prediciton <- predict(cancer_lm_2005, lower_lead_2005, interval = "prediction")

# create df with predicitons
lead_2005_predicitons <- rbind(high_lead_2005_prediciton, low_lead_2005_prediciton)
lead_2005_predicitons <- as.tibble(lead_2005_predicitons)

# name precitions
lead_2005_prediction_names <- tibble(prediction_type = c("Higher Lead Concentrations (0.2 micrograms)", "Lower Lead Concentrations (0.05 micrograms"))

# bind predicitons together
lead_2005_predictions_as_tibble <- cbind(lead_2005_predicitons, lead_2005_prediction_names)

# output lead predicitions as a table
pander(lead_2005_predictions_as_tibble[c("prediction_type", "fit", "upr", "lwr")], round = 3, caption = "Predicitons of Cancer Rates in 2005 Based on High and Low Lead PM2.5 LC Concentrations", digits = 6)
Predicitons of Cancer Rates in 2005 Based on High and Low Lead PM2.5 LC Concentrations
prediction_type fit upr lwr
Higher Lead Concentrations (0.2 micrograms) 183.081 211.847 154.314
Lower Lead Concentrations (0.05 micrograms 183.09 211.857 154.324
# create small dfs
higher_benzene_2005 <- tibble(parameter_name = "Benzene", arithmetic_mean = 5)
lower_benzene_2005 <- tibble(parameter_name = "Benzene", arithmetic_mean = 2)

# make the predicition
high_benzene_2005_prediciton <- predict(cancer_lm_2005, higher_benzene_2005, interval = "prediction")
low_benzene_2005_prediciton <- predict(cancer_lm_2005, lower_benzene_2005, interval = "prediction")

# create df with predicitons
benzene_2005_predicitons <- rbind(high_benzene_2005_prediciton, low_benzene_2005_prediciton)
benzene_2005_predicitons <- as.tibble(benzene_2005_predicitons)

# name precitions
benzene_2005_prediction_names <- tibble(prediction_type = c("Higher Benzene Concentrations (5 ppbc)", "Lower Benzene Concentrations (2 ppbc)"))

# bind predicitons together
benzene_2005_predictions_as_tibble <- cbind(benzene_2005_predicitons, benzene_2005_prediction_names)

# output predicitions as a table
pander(benzene_2005_predictions_as_tibble[c("prediction_type", "fit", "upr", "lwr")], round = 3, caption = "Predicitons of Cancer Rates in 2005 Based on High and Low Benzene Concentrations", digits = 6)
Predicitons of Cancer Rates in 2005 Based on High and Low Benzene Concentrations
prediction_type fit upr lwr
Higher Benzene Concentrations (5 ppbc) 183.897 212.664 155.13
Lower Benzene Concentrations (2 ppbc) 184.094 212.861 155.327
# create small dfs
higher_vc_2005 <- tibble(parameter_name = "Vinyl chloride", arithmetic_mean = 0.2)
lower_vc_2005 <- tibble(parameter_name = "Vinyl chloride", arithmetic_mean = 0.05)

# make the predicition
high_vc_2005_prediciton <- predict(cancer_lm_2005, higher_vc_2005, interval = "prediction")
low_vc_2005_prediciton <- predict(cancer_lm_2005, lower_vc_2005, interval = "prediction")

# create df with predicitons
vc_2005_predicitons <- rbind(high_vc_2005_prediciton, low_vc_2005_prediciton)
vc_2005_predicitons <- as.tibble(vc_2005_predicitons)

# name precitions
vc_2005_prediction_names <- tibble(prediction_type = c("Higher Vinyl Chloride Concentrations (0.2 ppbc)", "Lower Vinyl Chloride Concentrations (0.05 ppbc)"))

# bind predicitons together
vc_2005_predictions_as_tibble <- cbind(vc_2005_predicitons, vc_2005_prediction_names)

# output predicitions as a table
pander(vc_2005_predictions_as_tibble[c("prediction_type", "fit", "upr", "lwr")], round = 3, caption = "Predicitons of Cancer Rates in 2005 Based on High and Low Vinyl Chloride Concentrations", digits = 6)
Predicitons of Cancer Rates in 2005 Based on High and Low Vinyl Chloride Concentrations
prediction_type fit upr lwr
Higher Vinyl Chloride Concentrations (0.2 ppbc) 182.97 211.737 154.202
Lower Vinyl Chloride Concentrations (0.05 ppbc) 182.98 211.747 154.212
# create small dfs
higher_cadmium_2005 <- tibble(parameter_name = "Cadmium PM2.5 LC", arithmetic_mean = 0.003)
lower_cadmium_2005 <- tibble(parameter_name = "Cadmium PM2.5 LC", arithmetic_mean = 0.001)

# make the predicition
high_cadmium_2005_prediciton <- predict(cancer_lm_2005, higher_cadmium_2005, interval = "prediction")
low_cadmium_2005_prediciton <- predict(cancer_lm_2005, lower_cadmium_2005, interval = "prediction")

# create df with predicitons
cadmium_2005_predicitons <- rbind(high_cadmium_2005_prediciton, low_cadmium_2005_prediciton)
cadmium_2005_predicitons <- as.tibble(cadmium_2005_predicitons)

# name precitions
cadmium_2005_prediction_names <- tibble(prediction_type = c("Higher Cadmium PM2.5 LC Concentrations (0.003 micrograms)", "Lower Cadmium PM2.5 LC Concentrations (0.001 micrograms"))

# bind predicitons together
cadmium_2005_predictions_as_tibble <- cbind(cadmium_2005_predicitons, cadmium_2005_prediction_names)

# output predicitions as a table
pander(cadmium_2005_predictions_as_tibble[c("prediction_type", "fit", "upr", "lwr")], round = 3, caption = "Predicitons of Cancer Rates in 2005 Based on High and Low Cadmium PM2.5 LC Concentrations", digits = 6)
Predicitons of Cancer Rates in 2005 Based on High and Low Cadmium PM2.5 LC Concentrations
prediction_type fit upr lwr
Higher Cadmium PM2.5 LC Concentrations (0.003 micrograms) 188.682 217.45 159.915
Lower Cadmium PM2.5 LC Concentrations (0.001 micrograms 188.683 217.45 159.915

I made quantitative predictions, using my linear model, based on the four pollutants I mapped spatially and deemed had at least a qualitative relationship to cancer rates. Each essentially show no change in cancer rates with different amounts of a chemical pollutant. Any minute change is the 1.5% R^2 at work. This verifies that the model doesn’t have any real predictive capacity.

Conclusions

The cause of cancer is an incredibly complex issue that we don’t really understand yet. In all likelihood, cancer is caused by thousands of different influences throughout a person’s life that build, and we can’t discount that luck and genetics don’t play a role in the equation either, probably even more than the role my model’s predictors did. While my model’s predictive ability is only a little bit over 1%, that is 1% of a model that could be used to predict cancer rates and I would argue that for a simple analysis such as this one, is pretty impressive. While model isn’t useful on its own, a larger project building a model that would be informed by many more predictors including the ones that I produced, could be useful. While atmospheric concentrations of the selected carcinogenic chemical pollutants aren’t useful in terms of cancer predictions on their own, and certainly play a role, albeit a small one, in the rate of cancer.

Datasets Utilized

Environmental Protection Agency. (2017). Hazardous Air Pollutants. Source: Kaggle. Retrieved: April, 05, 2002. https://www.kaggle.com/datasets/epa/hazardous-air-pollutants

Centers for Disease Control and Prevention. (2022). Cancer Mortality by State. Source: CDC. Retrieved: April, 12, 2022. https://www.cdc.gov/nchs/pressroom/sosmap/cancer_mortality/cancer.htm